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Abstract 

A new format for storing sparse matrices is proposed for efficient sparse matrix-vector (SpMV) product calculation 
on modern throughput-oriented computer architectures. This format extends the standard compressed row storage 
(CRS) format and is easily convertible to and from it without any memory overhead. Computational performance of an 
SpMV kernel for the new format is determined for over 140 sparse matrices on two Fermi-class graphics processing units 
(CPUs) and the efficiency of the kernel, which peaks at 36 and 25 GFLOPS at single and double precision, respectively, 
is compared with that of five existing generic algorithms and industrial implementations. The efficiency of the new 
format is also measured as a function of the mean (/x) and of the standard deviation (a) of the number of matrix nonzero 
elements per row. The largest speedup is found for matrices with /i > 20 and /a > a > 1.5 and can be as high as 43%. 

Keywords: SpMV, GPGPU, CRS, CSR, parallel computing, CUDA 
PACS: 02.60.-x, 07.05.Bx 
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1. Introduction 

The sparse matrix- vector (SpMV) multiplication is one 
of the most important kernels in scientific computing with 
numerous applications ranging from sparse linear solvers 
to the PageRank algorithm used by Google in its Web 
search engine |l|. However, its high memory bandwidth 
requirements combined with the poor data locality exhib- 
ited by typical sparse matrices result in poor performance 
on general purpose processors which usually attain only 
a small fraction of their peak performance in this kernel. 
The literature devoted to SpMV optimization techniques 
on traditional, cache-based processor designs is ample (see 
;[2j for an extensive overview) , and currently one of the ma- 
jor issues is how to exploit new features available in multi- 
core hardware. Several SpMV optimization strategies were 
already proposed for various designs of chip multiproces- 
sors, including multicore central processing units (CPUs) 
from Intel and AMD 0], single- and dual-socket STI Cell 
(3, Sun UltraSPARC T2 0, field programmable gate ar- 
rays (FPGAs) [J,|5j, and graphics processing units (GPUs) 

In just a few years GPUs have evolved from fixed- 
pipeline application-specific integrated circuits into highly 
programmable, versatile computing devices. Under con- 
ditions often met in scientific computing, modern GPUs 
surpass CPUs in computational power, data throughput 
and computational efficiency per watt by almost an or- 
der of magnitude. They are not only the core ingredi- 
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ent of many today's and tomorrow's petaflop supercom- 
puters, but also an affordable mathematical accelerator 
for everyday usage, with peak computational performance 
matching that of the most powerful supercomputers of 
only a decade ago. These devices can be programmed 
in high-level programming languages, e.g. Nvidia's C-\ — \- 
for CUDA (proprietary) or OpenCL (open standard), and 
several high-quality mathematical libraries as well as bind- 
ings to many high-level programming languages are avail- 
able, including C, C++, Python and Matlab. The major 
problem in their usage is software: any new hardware ar- 
chitecture will succeed only if appropriate software can be 
developed to exploit the parallelism in the hardware effi- 
ciently. However, the current multicore architectures are 
so diverse and are subject to so frequent changes that to 
exploit their potential the applications must be highly spe- 
cialized and use architecture-specific optimization strate- 
gies. Moreover, massively parallel architectures, like the 
one utilized in modern GPUs, require to use a completely 
new programming paradigm, which in turn requires rad- 
ical rethinking of how numerical computations should be 
performed. Since the vast majority of the existing scien- 
tific software adheres to "old" programming paradigms, 
and since development of this software took many, many 
millions man-hours, perhaps the best way to introduce new 
concepts is by concentrating on the most important kernels 
and showing their usability in the "old" environment. 

In this paper we present an efficient SpMV algorithm 
optimized for throughput-oriented processors and exam- 
ine its implementation for Nvidia's CUDA-enabled GPUs. 
The algorithm is based on a new sparse matrix storage for- 



Preprint submitted to Elsevier 



February 26, 2013 



mat designed as an extension of a popular compressed row 
storage (CRS, also known as compressed sparse row, CSR) 
format. It is characterized by a small memory footprint 
and small conversion times to other storage formats, which 
should facilitate its adoption in existing applications. It 
also turns out to be among the fastest SpMV algorithms 
available for GPUs. These features make it a candidate for 
a universal method of doing the SpMV multiplication ef- 
ficiently on a wide range of throughput-oriented multicore 
architectures. 

We compared the efficiency of our GPU implementa- 
tion of the new format with that of several generic GPU 
SpMV algorithms as well as industrial libraries. However, 
in contrast to many recent studies on GPU SpMV, which 
were based on small sets of sparse matrices, which in turn 
almost reduces the papers to case studies, here we use a 
far larger set from the University of Florida Sparse Ma- 
trix Collection (UF SMC) [ij|. This enabled us to make 
some general statements not only about the absolute per- 
formance of our CMRS-based implementation, but also 
about relative performance of several other alternative so- 
lutions. We also try to exploit on a broader scale some 
general characteristics of any SpMV product implementa- 
tion, e.g. the bandwidth efficiency, in the hope that our re- 
sults will be valuable even after the next-generation GPU 
architectures have been released. 

The structure of the paper is as follows. In Sec. [2] we 
introduce the problem of SpMV product calculation and 
briefly describe specific computer architecture features we 
use to solve it. Section [3] contains the definition of the new 
sparse matrix storage format, and Sec. U describes its im- 
plementation for Fermi-class GPUs. Results of extensive 
performance tests of the new implementation and its com- 
parison with other, alternative solutions is given in Sec. [5] 
Finally, Sec. [5] is devoted to conclusions. 

2. Problem statement 

2.1. SpMV multiplication 

The aim of the SpMV multiplication algorithm is to 
calculate the product y = Ax, where A is a large sparse 
matrix and x, y are dense vectors. Typical matrices in- 
volved in the SpMV product have thousands or even mil- 
lions of rows and columns and are very sparse, with an 
average number of nonzero elements per row rarely ex- 
ceeding 100. In some cases the pattern of nonzero ele- 
ments is quite regular (e.g. for problems resulting from 
discretization of partial differential equations on regular 
grids), but the most interesting — and difficult — matrices 
are those whose distribution of nonzero elements appears 
to be random. 

Nonzero elements of A are usually stored in a one- 
dimensional auxiliary array, and additional information is 
needed to uniquely map the values in the array to their lo- 
cations in A. The way this information is stored is called 
a sparse matrix format. Calculation of a sparse matrix- 
vector product essentially reduces to many "multiply and 



add" operations, which in modern GPUs are implemented 
as a single fused multiple-add (FMA) instruction. Thus, 
SpMV multiplication involves several memory accesses per 
each arithmetic instruction, which means that the SpMV 
kernel is inherently memory-bound. For example, server- 
class Tesla M2090 GPU can perform w 3.3 x 10 11 FMA 
operations per second and can access its main memory at 
w 1.7 x 10 11 B/s, which yields the ratio of about two opera- 
tions per byte. Therefore, if four eight-byte long operands 
of each FMA instructions were to be read from and writ- 
ten to the off-chip memory, the memory bottleneck would 
restrict the processor computational efficiency to 1/64 of 
its peak theoretical value. Although this value can be in- 
creased by using fast on-chip memory, other factors, like 
additional memory transactions necessary to read sparse 
matrix format data or reduced off-chip memory through- 
put due to poor data locality and indirect addressing, can 
decrease it to even smaller values. The main challenge 
in designing an efficient and universal SpMV algorithm 
is thus how to exploit and balance all the performance- 
related features available in hardware, focusing on the 
memory subsystem utilization. 

2.2. GPU architecture and the CUD A programming model 
The architecture of modern GPUs is a massively paral- 
lel design which excels in computing-intensive stream data 
processing. Here we briefly discuss the main properties 
of Nvidia's "Fermi" GPU architecture 16] introduced in 
2010. 

A GPU contains a number of units called multiproces- 
sors, each one containing a set of relatively simple com- 
puting cores called CUDA cores. From the programmer's 
perspective, multiprocessors are essentially independent 
single-instruction multiple data (SIMD) devices in which 
a group of 32 CUDA threads, called a warp, executes the 
same instruction on multiple data simultaneously. Mul- 
tiprocessors are connected to high bandwidth (up to w 
200 GB/s), high latency (« 800 clock cycles), limited size 
(<6 GB) external dynamic random access memory through 
a coherent L2 cache (768 KB). Each multiprocessor has an 
LI cache (16 or 48 KB), a read-only texture cache (6 to 
8 kB) and a constant cache (8 KB). Beside these hardware- 
managed caches, GPUs contain also several fast on-chip 
memories managed in software: so called shared memory 
and registers. 

While CPUs are equipped with sophisticated hardware 
mechanisms, like complex cache memories, branch predic- 
tion and out-of-order execution, which allow them to ex- 
ecute efficiently even a poorly written code, GPUs con- 
sist of relatively simple units which must be controlled on 
much lower level to achieve good performance. This makes 
GPU programming harder, but at the same time paves 
the way for new optimization strategies. These strate- 
gies require detailed information on physical capabilities, 
limitations and mutual dependencies of various GPU com- 
ponents lg, |l7[. For example, the size of the LI cache is 
guaranteed to be configurable at runtime (48 or 16 KB), 



the constant cache is bound to a 64 KB read-only mem- 
ory space residing in global memory, and the texture cache 
can be assigned to any area in global memory at runtime. 
The shared memory (16 or 48 KB) is shared by all threads 
belonging to well-defined groups of threads (called blocks) 
executing on the same multiprocessor. In addition, each 
multiprocessor has a pool of 2 15 32-bit registers that can 
be combined in pairs for double precision arithmetic. Reg- 
isters form the fastest memory (« 8 TB/s for GTX 480 
17|,ll8|), as the CUDA cores can operate directly on them. 
Next comes the shared memory (w 1.3 TB/s), then the 
global memory (w 170 GB/s), and finally the host CPU 
memory connected to the GPU through a PCI-Express bus 
(« 5 GB/s). Various memories available in GPUs differ 
not only in their speed, but also in latencies. For example, 
the latency of registers (which is caused by register de- 
pendencies, e.g. when an input operand is written by the 
previous instruction) is ~ 20 clock cycles, whereas the la- 
tency of the global memory accesses can be as high as 800 
clock cycles. To hide such high latencies, each multiproces- 
sor loads into its registers the states of up to 1536 threads 
(forming up to 48 warps) and attempts to execute the warp 
that has all operands ready for execution. This leads to 
massive parallelism with up to 24 576 threads being pro- 
cessed on-chip simultaneously. Another factor crucial for 
GPU efficiency is the memory access pattern. For exam- 
ple, shared memory is divided into 32 equally-sized banks 
and whenever two or more addresses of a memory request 
fall in the same memory bank, the access has to be seri- 
alized, which can degrade the shared memory bandwidth 
by an order of magnitude. Similarly, the global memory 
can be read or written at full speed only when the mem- 
ory accesses can be coalesced into 128-byte transactions 
matching 128-byte aligned segments in device memory. 

CUDA is an abstract general purpose parallel comput- 
ing architecture, programming model, and programming 
environment designed for Nvidia's GPUs |17j]. It is based 
on a few key concepts such as groups of threads (arranged 
hierarchically in warps, blocks and a grid), shared mem- 
ories and barrier synchronization. These concepts are ex- 
posed to the programmer through a minimal set of exten- 
sions to a high-level programming language (C or C-\ — h). 
Warps within a block of threads can be executed in any 
order; similarly, each block of threads can be run on any of 
the available multiprocessors in an arbitrary order, sequen- 
tially or in parallel. These features guarantee scalability 
of CUDA software — once compiled, it can be executed on 
various devices with different numbers of multiprocessors 
and can be easily ported to GPU clusters. 

To cope with frequent changes in the GPU architec- 
ture, Nvidia introduced the concept of compute capability 
(cc), which is a specification of the core hardware proper- 
ties of a given device, e.g. the maximum number of threads 
in a block or support for atomic operations. The "Fermi" 
architecture corresponds to cc 2.0 and 2.1. Similarly, there 
are several versions of CUDA programming environments, 
the current being CUDA 4.1, which differ in their support 



for compute capabilities and C++ language constructs, 
quality of libraries, etc. Even though subsequent CUDA 
versions are backward compatible with older ones, fre- 
quent improvements in the CUDA environment and hard- 
ware compute capabilities make it difficult to maintain the 
highest optimization level of the software — a strategy de- 
veloped for cc 1.1 may turn out counterproductive for cc 
2.0. This implies that while the CUDA technology has a 
great potential, it should be still considered experimental. 

2.3. Existing matrix formats for SpMV multiplication 

The simplest sparse matrix format is the coordinate 
(COO) format, in which the information about the row 
index, column index, and the value of each non-zero matrix 
element is stored in three one- dimensional arrays. Rowlnd, 
Collnd, and Val, respectively. As an example, consider a 
5x5 matrix: 
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Its COO representation (with zero-based indexing) reads 

Val = [1 2 3 4 5 6 7 8 9 10], 
Collnd = [0314242344], 
Rowlnd = [0011223334]. 

To complete the matrix definition, one also needs to supply 
three integers: the number of matrix rows (rows), columns 
(cols) and non-zero elements (nnz). 

In the above example the row-major ordering was used, 
i.e., the matrix index arrays were first sorted by row indices 
and then by column indices. In such a case array Rowlnd 
will typically contain sequences of many identical entries. 
This property is utilized in the CRS format to reduce the 
memory footprint by replacing array Rowlnd with a shorter 
array RowPtr. In the most general case this array is de- 
fined by the requirement that RowPtr [j + 1] - RowPtr [j] 
be equal to the number of non-zero elements in the j-th 
row (j = 0, . . . , rows — 1). If the matrix contains no empty 
rows, RowPtr [j] gives the index into Val corresponding to 
the first non-zero element in the j-th matrix row. Array 
RowPtr has exactly rows+1 elements and RowPtr [rows] 
= nnz. Thus, the CRS representation of M reads 

Val = [1 2 3 4 5 6 7 8 9 10], 
Collnd = [0314242344], 
RowPtr = [0 2 4 6 9 10]. 

Note that arrays Val and Collnd are the same as in the 
COO format. 

Let k be the maximum number of non-zero elements 
per row. In the ELLPACK/ITPACK (ELL) format an 
n x m sparse matrix is represented by two n x k dense 
arrays, Val and Collnd. Array Val is constructed from 



the original matrix by removing all zeros, while Collnd 
holds column indices into Val. The rows with less than k 
non-zero elements are padded in Val and Collnd arrays 
with and — 1, respectively. The ELL representation of 
M is thus: 
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While the ELL format belongs to the most efficient 
sparse matrix formats for vector architectures, it may in- 
volve a costly storage overhead. Several attempts have 
been made to modify this format so as to extend its practi- 
cal usability for general sparse matrices. One such attempt 
is the hybrid (HYB) format [6j, [7|, which is a combination 
of the ELL and COO formats. Another interesting idea is 
to divide the matrix into several slices, each represented 
separately in the ELL format, and/or use some kind of pre- 
processing, e.g. permutation of rows and columns 
to reduce padding. 



2.4- Existing GPU implementations 

One of the first efficient implementations of SpMV on 
GPU architecture was proposed by Bell and Garland [6j 
from Nvidia. They implemented SpMV kernels for several 
sparse matrix formats, including COO, ELL, and HYB. 
In addition, two SpMV kernels for the CRS format were 
provided: scalar and vector. The scalar kernel uses a naive 
approach by assigning one thread per matrix row, which 
leads to non-coalesced access to memory and poor perfor- 
mance. The vector kernel uses a 32-thread warp for each 
row which, on the one hand, provides contiguous access 
to the memory, but on the other hand it leads to a large 
bandwidth waste when the number of nonzero elements 
in a row is much smaller than the warp size. The tests 
indicated that the COO is not flexible enough to handle 
unstructured matrices efficiently. The ELL format is usu- 
ally faster than others, but fails whenever the number of 
nonzero elements in matrix rows varies significantly. The 
best overall performance was achieved for the HYB format, 
which offered the speed of ELL and flexibility of COO. 
Therefore the authors recommended HYB to be used as 
the fastest format for a broad selection of unstructured 
matrices. Their best kernel implementing SpMV on struc- 
tured matrices achieved the performance of 36 GFLOPS 
in single precision and 16 GFLOPS in double precision on 
a GeForce GTX 280 GPU. For unstructured matrices the 
authors reported the performance in excess of 15 GFLOPS 
and 10 GFLOPS in single and double precision, respec- 
tively. A slightly better performance was achieved for 
SpMV kernels tested on a GeForce GTX 285 GPU for the 
same set of test matrices [7J . 

Baskaran and Bordawekar 8] proposed a few optimiza- 
tion techniques to improve Bell's and Garland's SpMV ker- 



nels. By exploiting synchronization-free parallelism, opti- 
mized off-chip memory access, different thread mapping 
based on affinity, and by caching input vector elements in 
texture memory, they were able to outperform the HYB 
format in 9 out of 19 matrices on GeForce GTX 280 (with 
cache) and 7 matrices (without cache). 

The kernels investigated in [jj served as building blocks 
for CUSP [19j, an open-source library of sparse linear al- 
gebra and graph computations on CUDA. To improve coa- 
lescing of matrix data reads for sparse matrices in the CRS 
representation, if the average number of nonzeros per row 
is less than 32 this library virtually divides each warp into 

2, 4, 8 or 16 smaller parts, called vectors, and assigns them 
to different rows. 

Another direction of research on improving the effi- 
ciency of the SpMV kernel on GPUs focuses on various 
extensions and modifications of the ELL or CRS formats. 
This resulted in the development of the ELL-R [lCf , sliced- 
ELL [IH, ELLR-T Q, Sliced ELLR-T and rgCSR 
13j formats, as well as tiling and composite storage 14 1. 
Their major drawback, however, is that their usage in- 
volves a large memory overhead, either directly through 
zero-padding or indirectly through matrix preprocessing. 
This is hardly acceptable in serious numerical applications, 
as they typically require that the solver has the ability to 
handle large data sets. Since the GPU memory is rela- 
tively small and the connection to a larger CPU memory 
is slow, each of the bytes residing on the GPU must be 
used very cautiously. 

3. Compressed Multi-Row Sparse Format 



Efficiency of existing GPU implementations of the SpMV 
product is often significantly better if an ELL-based for- 
mats, e.g. the HYB format, is used instead of the CRS. The 
main reason for this is that GPUs are SIMD-like machines 
with relatively wide SIMD units, often far wider than the 
average number of nonzero elements per matrix row. Since 
efficient utilization of the CRS format requires the matrix 
elements to be accessed row by row, processing short rows 
in long SIMD-like units leads to wasting of the computa- 
tional capability of the device. Therefore, our main idea is 
to process a sparse matrix in chunks larger than individual 
rows, at the same time preserving the overall structure of 
the matrix representation typical of the CRS format. A 
group of rows processed by an individual SIMD unit shall 
be called 'strip', and the number of rows in a strip shall 
be called 'strip height' and denoted height. The number 
of strips, strips is thus equal to ceil (rows/height). 

The new sparse matrix format, which we call com- 
pressed multi-row storage (CMRS) format, comprises one 
integer parameter height and four arrays: data array Val 
and three auxiliary integer arrays, Collnd, StripPtr, and 
RowInStrip. Arrays Val and Collnd are the same as in 
the CRS format. Array StripPtr is a generalization of 
CRS array RowPtr and is defined by the requirement that 
StripPtr [j+1] - StripPtr [j] be equal to the number 



of non-zero elements in the j-th strip (j = 0, . . . , strips — 
1). If the sparse matrix contains no empty strips, StripPtr[j] 
gives the index into Val corresponding to the first non- 
zero element in the j-th strip. Finally, array RowInStrip, 
of length nnz, holds the row numbers within individual 
strips. 

Assume height = 2. Then the CMRS representation 
of M reads: 



Val = 

Collnd = 

StripPtr = 

RowInStrip = 



[123456789 10], 
[0314242344], 
[0 4 9 10], 
[0011001110]. 



Note that the conversion between the CRS and CMRS 
formats is trivial and easy to parallelize. In particular, 
StripPtr[j] = RowPtr[j * height] for j < strips and 
StripPtr[strips] = nnz, whereas RowInStrip [k] is the 
reminder of the CRS row number, RowPtr [k] , divided by 
height. It is also clear that both formats are equivalent 
if height = 1 (in which case RowInStrip should be set to 
NULL). 

4. Implementation 

Our GPU implementation of the SpMV product for 
matrices in the CMRS format is based on the vector kernel 
by Garland and Bell [7] , with rows replaced by strips. Each 
SIMD unit, or warp, is assigned a strip to process. The 
method of doing the SpMV product in a strip is presented 
in Algorithm [T] 

This algorithm is assumed to be executed in parallel 
by all threads in a warp. Implicit synchronization of the 
threads forming a warp is assumed. All auxiliary buffers 
and temporary variables are local to a warp, so no explicit 
synchronization of different warps is necessary, which al- 
lows for massively parallel processing of strips. The values 
of matrix elements and the corresponding column indices 
are read in parallel directly from arrays Val and Collnd, 
whereas row indices are assembled from the information 
hold in arrays StripPtr and RowInStrip. For sufficiently 
long strips these memory operations are coalesced to a 
high degree (j runs through consecutive matrix elements), 
and hence are very fast. Then the necessary elements of 
the input vector are fetched from the memory. This is the 
most sensitive part of each parallel SpMV implementation, 
as the vector elements required by a SIMD unit are often 
stored in memory locations scattered almost randomly in 
the memory, and hence their parallel processing is very 
problematic. Individual products of the matrix by vector 
elements are computed and stored in a buffer buf allocated 
in the fast on-chip shared memory. The buffer size is quite 
large, height • WARP_SIZE, as each thread needs its own 
memory buffer for each row. The exact mapping of thread 
lanes and row numbers into buf is arbitrary, as long as it 
is one-to-one, and affects the number of shared memory 
bank conflicts and the efficiency of the parallel reduction 



Algorithm 1 Processing of a strip in the SpMV kernel, 
A ■ x = b, for A in the CMRS format. This algorithm 
is executed in parallel by W_SIZE threads identified by 
thread_lane 6 {0, 1, . . . , W_SIZE — 1} and making up a 
warp. Parameter strip_id identifies the strip in the ma- 
trix. Suggested value of CMRS_BITS is 4. 



Require: Val, Collnd, StripPtr, height, x, y, 
strip_id < strips 

bufij <s— for all i,j 

M <— 2 CMRS - BITS 

strip.start <- StripPtr strlp _ ld 

strip.end <- StripPtr strlp _ id+1 

{j is current index into Val and Collnd} 

j <— strip.start + thread_lane 

while j < strip.end do 

{Load compressed values into register} 

c <— Collnd^ 

{Uncompress the value of RowInStripj} 

r <— c mod M 

{Uncompress the value of Collndj} 

C <r- [c/M\ 

{Threads update partial sums} 

buf t hread_lane,r 4— buf t hread_lane,r + %c ' "alj 

j <- j + W.SIZE 

end while 

{Parallel reduction of partial sums in rows} 



< 



buf ,r <- J2 



W.SIZE- 




buf,; 



0,..., height - 1 



row •<— strip_id • height + thread_lane 

{height elements of buf q contain row sums} 

if thread_lane < height and row < numjrows then 

?/row ^~~ buf o,thread_lane 

end if 



step. The mapping presented in Algorithm [T] is designed 
to minimize the latter factor. For example, for height = 8 
one can reduce 32 x 8 partial row sums in buf into 8 row 
sums using just 9 instructions. 

Our implementation needs height-fold more shared 
memory than in the vector kernel of Ref . [7j . On the one 
hand this is beneficial for the parallel reduction, but on the 
other hand it imposes a severe limit on acceptable values of 
height, as the buffer size in Fermi-class CPUs is restricted 
to only 48 KB. For example, if we take height = 8 and 
store the data as 8-byte numbers, 64 bytes of the shared 
memory will be needed for each thread, and so the maxi- 
mum number of resident threads per multiprocessor will be 
limited to 768, which is only half the number a multipro- 
cessor can handle. Taking into account that a large num- 
ber of resident threads is necessary to hide large memory 
latencies, we can safely assume that the maximum value 
of height in an efficient implementation for Fermi-class 
CPUs does not exceed 16. This, in turn, implies that the 
values stored in array RowInStrip are in the range 0. . . 15 
and hence can be encoded in just four bits. 

Recall that modern CPUs can support only up to 6 GB 
of memory. If we restrict our attention to square matrices 
that have at least one nonzero clement per row and col- 
umn, and if we take into account the storage necessary for 
the input and result vectors in double precision, we will 
find that the number of rows and columns of a sparse ma- 
trix for a problem that fits into 6 GB of memory is bounded 
by 2 28 . Therefore in our implementation we used 28 bits 
of Collnd [j] to store a column index and the remaining 4 
bits to store the corresponding value of RowInStrip. This 
made array RowInStrip superfluous and explains 'uncom- 
pression' steps in Algorithm [T] In this way the CMRS for- 
mat turns out to be even more compressed than the CRS 
one, i.e., it requires fewer bytes of storage. We believe that 
the SpMV kernel on GPUs is so much memory-bound that 
it is of utmost importance to reduce its memory footprint, 
even at the cost of several arithmetic operations, which in 
this kernel are almost free. 

We also implemented several optional speed optimiza- 
tions. The first one consists in buffering the input vec- 
tor in the texture cache [7( rather than in the LI cache. 
The second one consists in enlarging the shared memory 
size from 16 to 48 KB, at the cost of the LI cache size. 
The third optimization strategy, adapted form [8j, aims 
at improving the effective memory bandwidth for arrays 
Val and Collnd, for large \i = nnz/rows, by first access- 
ing the non-aligned portion of a strip and then accessing 
the remaining, aligned portion at full speed. The fourth 
one consists in reordering the elements of CMRS arrays 
so that the index array Collnd is first sorted by strip in- 
dices and then within the same strip by column indices. 
The idea behind such ordering is the same as for the row- 
major ordering in the CRS format: enhance the frequency 
of coalesced or cache-buffered accesses of a SIMD unit to 
the elements of the input vector. Note that most matrices 
coming from real problems have some internal structure 



Table 1: Theoretical peak capabilities of the devices used in tests 

GTX 480 C2070 

single prec. perform. [TFLOPS] 1.3 1 

double prec. perform. [TFLOPS] 0.17 0.5 

memory bandwidth [GB/s] 177 144 



and the locations of nonzero elements in neighboring rows 
are correlated. In such cases reordering the entries in the 
CMRS arrays can have a pronounced impact on SpMV 
efficiency. Assuming again height = 2, the CMRS repre- 
sentation of M after data reordering would read: 

Val = [1 3 2 4 5 7 8 6 9 10], 

Collnd = [0134223444], 

RowInStrip = [0101011010], 

StripPtr = [ 4 9 10 ] , 

Note that the reordering affects only the matrix internal 
representation and does not involve any actions on the 
input and output vectors. Moreover, reordering is local to 
strips and hence is prone to parallelization. 

5. Results 

5.1. Hardware and software specification 

The tests were performed on two Nvidia devices, GTX 
480 (1.5 GB) and Tesla C2070 (6 GB). The ECC mem- 
ory support in the Tesla device was switched off for a 
larger bandwidth. In both cases the operating system 
was a 64-bit Linux with Nvidia GPU driver v. 290.10 and 
CUDA 4.1. Theoretical peak capabilities of these devices 
are listed in Table [TJ 

5.2. Test matrices and methodology 

The tests were performed using 138 real matrices from 
the University of Florida Sparse Matrix Collection satis- 
fying 10 6 < nnz < 10 8 , including all matrices with nnz > 
5 x 10 6 , and three additional sparse matrices of our choice. 
The matrices are of various sizes and represent a wide spec- 
trum of applications and structure patterns. In particu- 
lar, our test case includes all matrices used in several re- 
cent studies on GPU SpMV performance @, @-[ll 0, EI | • 
The additional matrices include an artificial matrix, p7, 
which is a large (10 7 x 10 7 ) random permutation matrix, 
dense 10000, which is a dense matrices, and aorta, which 
is a sparse matrix representing the pressure equation in the 
problem of the flow through the human abdominal aorta 
[22l ] . We included p7 to get a better insight into the role 
played by structural correlations between adjacent rows 
and the impact of (un)coalesced accesses to the input vec- 
tor; denselOOOO is an example of a matrix which can be 
processed at the highest speed; and aorta is an example 
of a sparse matrix for which efficiency of the SpMV kernel 
is of utmost importance, as it directly affects the time to 



solution in biomedical applications. Except for rail4284 
used in [3, [l| s all test matrices are square. 

For each test matrix and each combination of optimiza- 
tion parameters, our CMRS implementation of the SpMV 
product was called 10 times and the execution times were 
recorded. The largest time was omitted (the GPUs were 
connected to the display, which could occasionally inter- 
fere with the measurement) and the remaining 9 results 
were analyzed to find their average and standard devia- 
tion. The optimization parameters included the strip hight 
(height = 2,4,8,16), block size (BS = 64?', j = 1,...,8), 
and four boolean parameters referring to the optimization 
techniques described in Sec. [H so that the total number 
parameter combinations for each matrix was 512. We then 
searched for the parameters that give the shortest SpMV 
execution time (brute- force search), t. These computa- 
tions were then repeated using three freely available library 
implementations of the SpMV routine on GPUs: Nvidia 
CUDA cuSparse 4.1 implementations for CRS and HYB 
formats and CUSP 0.2.0 implementation (CSR-tex) for the 
CRS format. Each library function was called using the 
default library configuration. Finally, we used the brute- 
force method to find the best possible SpMV times for two 
SpMV kernels: scalar and vector, as described in Sec. 12.41 
The scalar kernel was optimized with respect to the value 
of the block size and the usage of the texture cache. The 
vector kernel was optimized with respect to all parameters 
used for CMRS optimization. 

For each matrix the computational efficiency of the 
CMRS SpMV kernel was determined as the ratio of the 
number of elementary arithmetic operations to the SpMV 
kernel time, i.e. (2nnz — rows)/r. The bandwidth was cal- 
culated as the total number of bytes that had to be trans- 
ferred to or from the GPU main memory, /3, divided by r. 
We considered two extreme cases: the input vector either 
is not cached or is fully cached. The bytes transferred in 
each case, /3_ and /3+, respectively, are calculated from 



/3_ = 20nnz + 4strips + 8rows, 
f3 + = 12nnz + 4strips + 16rows, 



(2) 



and the bandwidth can be defined either as /3_/r or /3+/t. 
This leads to two definitions of memory utilization effi- 
ciency, r/±: 

(3± 1 

t B 

where B is the hardware theoretical bandwidth of the de- 
vice. Clearly, ij + < 1 for all CMRS SpMV implemen- 
tations and a value of ry_ > 1 indicates that the device 
efficiently buffers the input vector in its caches. 

5. 3. SpMV multiplication results 

Selected results for the CMRS SpMV kernel running in 
double precision are presented in Tab.[5J results for the re- 
maining matrices can be found in the ladditional matcriall 
This table contains the names and main properties of the 
test matrices, the best CMRS SpMV time, comparison of 



this time with the ones obtained using existing libraries 
or other well known algorithms, and analysis of the ker- 
nel efficiency in terms of the computational efficiency and 
memory bandwidth. The table is ordered according to 
the increasing number of nonzero elements per row, /i. 
Relative speedups against 5 other implementation are ex- 
pressed as dimensionless parameters. These 5 implemen- 
tations include two closed-source cuSparse 4.1 library im- 
plementations (CRS and HYB), one open-source library 
CUSP 0.2.0, and our implementations of two well-known 
CRS kernels, vector and scalar, as described in Sec. 12.41 

5.3.1. Artificial matrices 

We start our analysis from two artificial matrices: p7 
and denselOOOO. The SpMV kernels for these matrices 
could be implemented much more efficiently using other 
methods, but we use them as well-defined benchmarks for 
real- world sparse matrices. 

Matrix denselOOOO is a dense 10 000 x 10 000 matrix 
represented as a sparse matrix. The structure of dense 
matrices allows for fully coalesced memory accesses to all 
data, including the input vector, and efficient utilization 
of caches. This matrix yields the highest effective band- 
width, 258 GB/s, which is almost 50% larger than the 
theoretical peak hardware bandwidth (77- = 1.48 3> 1), a 
phenomenon resulting from buffering the input vector in 
device caches. Note that the previous generation GPUs, 
e.g. GTX 285, which had neither LI nor L2 caches, allowed 
for far less efficient data caching (??_ < 1.08) [7]. 

If one assumes perfect caching of the input vector, the 
bandwidth for denselOOOO drops to 155 GB/s, which is be- 
low the theoretical peak of 177 GB/s, but above 148 GB/s 
reported by the bandwidthtest utility program from the 
CUDA 4.1 SDK. This has several interesting consequences. 
First, the SpMV kernel is clearly memory-bound and the 
contribution to its execution time from arithmetic opera- 
tions, e.g. parallel reduction, is practically negligible. Sec- 
ond, the performance metrics obtained for this matrix 
(77+ = 0.87, 26 GFLOPS, 258 GB/s) can be regarded as 
upper bounds for any CMRS format implementations on 
the Fermi architecture and are very close to the physical 
limit (r/ + = 1). Third, even though Nvidia artificially de- 
creased four-fold the rate of double precision operations in 
consumer GeForce cards, this has practically no impact on 
the SpMV kernel performance. 

Two alternative SpMV kernels, HYB and scalar, per- 
form rather poorly for dense matrices. One reason for this 
is that these kernels assign one thread per matrix row. 
As the number of rows is rather small, the GPU cannot 
launch enough threads to hide memory latency. Moreover, 
the scalar kernel cannot coalesce memory accesses for ma- 
trices with high (i. 

Another artificial matrix, p7, constitutes another ex- 
treme case in which accesses to the input vector are totally 
uncoalesced. In such a case a warp must issue 32 (128-byte 
long) memory requests to fetch 32 doubles from the mem- 
ory, whereas a fully coalesced memory access would require 



Tabic 2: Selected matrices, their properties and corresponding double precision SpMV results for the CMRS format on Nvidia GTX 480, 
ordered according to increasing fi. See the text for column definitions. A dash ( — ) indicates that the corresponding SpMV library failed. 



matrix properties 




SpMV 




speedup against 




computat. 


memory util. 


name 


M 


nnz 

[xlO 6 ] 


rows 

[xlO 3 ] 


time 

[ms] 


CRS 


HYB 


CUSP 


vect. 


seal. 


perform. 
[GFLOPS] 


bandw. 

[GB/s] 


V- 


V+ 


p7 


1 


10.0 


10000 


11.12 


0.7 


0.6 


0.7 


4.7 


0.7 


0.9 


26 


0.14 


0.14 


asia_osm 


2 


25.4 


11951 


14.27 


0.6 


0.3 


0.4 


4.3 


0.3 


2.7 


43 


0.24 


0.20 


webbase 


3 


3.1 


1000 


1.47 


0.8 


0.4 


0.9 


3.8 


1.4 


3.5 


48 


0.27 


0.21 


lpl 


3 


1.6 


534 


3.26 


11.9 


0.1 


4.3 


1.0 


12.5 


0.8 


12 


0.07 


0.05 


mc2dpi 


4 


2.1 


526 


0.72 


0.7 


0.3 


0.5 


4.3 


0.4 


5.1 


65 


0.36 


0.26 


aorta 


5 


2.4 


497 


1.07 


1.0 


1.0 


1.1 


2.9 


1.0 


4.1 


49 


0.28 


0.20 


mac_econ 


6 


1.3 


207 


0.40 


1.3 


0.8 


0.9 


3.4 


0.9 


5.9 


69 


0.39 


0.27 


scircuit 


6 


1.0 


171 


0.31 


1.2 


0.6 


0.9 


3.7 


0.9 


5.5 


66 


0.37 


0.26 


cagel5 


19 


99.2 


5155 


14.29 


2.2 


— 


2.4 


2.1 


2.9 


13.5 


142 


0.80 


0.50 


cop20k_A 


21 


2.6 


121 


0.44 


2.0 


1.0 


1.9 


1.9 


3.7 


11.6 


121 


0.68 


0.43 


shipsecl 


25 


3.6 


141 


0.51 


2.1 


1.1 


2.0 


2.1 


3.1 


13.8 


143 


0.81 


0.50 


qcd5_4 


39 


1.9 


49 


0.23 


2.0 


0.8 


1.6 


1.7 


4.9 


16.4 


168 


0.95 


0.58 


rmalO 


51 


2.4 


47 


0.28 


1.6 


1.4 


1.4 


1.5 


4.1 


16.6 


169 


0.95 


0.58 


pwtk 


52 


11.5 


218 


1.11 


1.9 


0.9 


1.9 


1.6 


6.1 


20.5 


209 


1.18 


0.72 


cant 


64 


4.0 


62 


0.42 


1.6 


1.0 


1.4 


1.4 


4.9 


19.0 


193 


1.09 


0.66 


02-raefsky3 


70 


1.5 


21 


0.16 


1.5 


0.9 


1.3 


1.3 


4.8 


18.7 


190 


1.07 


0.65 


consph 


72 


6.0 


83 


0.62 


1.5 


0.9 


1.5 


1.4 


5.0 


19.3 


196 


1.11 


0.67 


Chebyshev4 


79 


5.4 


68 


1.68 


1.2 


0.7 


0.8 


0.6 


12.2 


6.3 


64 


1.05 


0.64 


pdblHYS 


119 


4.3 


36 


0.41 


1.3 


1.3 


1.3 


1.2 


5.4 


21.1 


213 


1.20 


0.73 


tsopf_rs_b2383 


424 


16.2 


38 


1.32 


1.2 


3.2 


1.1 


1.1 


6.5 


24.5 


245 


1.39 


0.83 


rail4284 


2633 


11.3 


4 


2.33 


1.4 


1.3 


1.0 


0.9 


11.4 


9.7 


97 


0.55 


0.33 


denselOOOO 


10 4 


100.0 


10 


7.76 


1.1 


— 


1.2 


1.0 


6.3 


25.8 


258 


1.46 


0.87 



only 2 requests. This leads to the memory bandwidth 10 
times smaller than for denselOOOO. Note that HYB and 
scalar kernels perform much better than CMRS, probably 
due to their better coalescing of matrix data fetches. 

5.3.2. Sparse matrices 

Of all test matrices from the University of Florida Sparse 
Matrix Collection, TS0PF_RS_b2383 yields the highest band- 
width, 245 GB/s, and the highest computational perfor- 
mance, 24.5 GFLOPS. These results are very close to those 
obtained for denselOOOO and hence to the physical limit 
(?7+ very close to 1), which means that for some real- 
world sparse matrices the CMRS format allows for almost 
full data coalescence and very efficient cache utilization 
(?7_ = 1.39). It's interesting to notice that Nvidia HYB is 
over 3 time less efficient for this matrix. 

Matrix lpl is the other extreme — the memory band- 
width obtained for this matrix, 12 GB/s, is over two times 
worse than for matrix p7, which indicates that our imple- 
mentation cannot coalesce memory accesses to the data 
stored in this matrix, let alone the input vector. Even 
more striking is the comparison of our implementation 
with Nvidia HYB and the scalar kernels — the former is 
9 times faster, and the latter runs 12.5 more slowly than 
our implementation. The reason for this strange behavior, 
leading to over a 100-fold (!) difference in the computa- 
tional efficiency between the HYB and scalar kernels, is the 



structure of lpl. The first row in this matrix has 249 643 
nonzero elements (w nnz/ 6), while the next longest row 
has only 99 nonzero elements. In the HYB algorithm most 
of nonzero elements in the very long row is represented in 
the COO format and processed by a group of warps. In 
our implementation of the CMRS format this row is pro- 
cessed by a single warp and in the scalar CRS algorithm 
it is processed by only a single thread. Therefore the effi- 
ciency of both CMRS and scalar kernels suffer from GPU 
memory latency, which in this case is very high (recall that 
a GPU needs to launch tens or even hundreds of warps to 
hide various memory latencies). 

Poor performance of the scalar kernel for lpl rises a 
question whether this kernel is of any practical relevance. 
We found only one matrix, asia_osm, for which the scalar 
kernel gives the shortest SpMV time. For this matrix it is 
w 3.5 times faster than our CMRS-based implementation 
and w 15% faster than the HYB implementation. 

Matrix Chebyshev4 is another interesting example. It 
contains 16 rows which are fully or almost fully filled with 
nonzero values, arranged in 8 pairs of adjacent rows. On 
the one hand, grouping a few adjacent rows into strips 
would lead to the same latency problems as with lpl. On 
the other hand, in the HYB algorithm these rows are pro- 
cessed in the COO format, which is less efficient than the 
vector CRS kernel. For this reason the best SpMV time 
for this matrix is achieved by the vector kernel. 



Matrix cagel5, which requires w 1.3 GB of storage, 
is the largest test matrix that fitted into the 1.5 GB of 
memory available in GTX 480. Our implementation deals 
with such a large matrix well and outperforms other im- 
plementations by a factor of 2 or more. In particular, since 
the CMRS format is an extension of the CRS format, it 
is possible to implement conversions between CMRS and 
CRS that use essentially the same storage. In contrast 
to this, the HYB and CRS format are completely different 
from each other and conversion between them requires two 
large, separate storages on the device. Since the only way 
to use CUDA 4.1 implementation of the HYB format is 
to load a matrix in the CRS format into the device and 
then convert it to the HYB format, this implementation 
cannot be used for matrices that occupy more than half of 
the available GPU memory. 

Finally, matrix aorta, which comes from a real prob- 
lem where the SpMV kernel efficiency is an essential issue, 
shows that the best results are obtained for three imple- 
mentations, including ours. However, the computational 
performance for this matrix, is rather disappointing (4.1 
GFLOPS, rj + = 0.20) and calls for further optimization. 

5.4- Comparison with other implementations 

The data collected in Tab. [5] are not representative for 
general sparse matrices and cannot be used for evaluation 
of different GPU SpMV algorithms and implementations. 
Instead, we used statistical analysis of the results obtained 
for 138 sparse matrices from UF SMC, assuming that this 
collection contains a representative sample of sparse ma- 
trices. 

A fair comparison of implementations alternative to 
ours must take into account that CRS, HYB and CUSP 
are complex implementations which exploit various simple 
SpMV kernels. Which kernel with which internal opti- 
mization parameters is actually used depends on the ma- 
trix structure, which may require some matrix preprocess- 
ing. These complex implementations have, essentially, no 
free parameters, but their efficiency could be increased 
by adjusting some internal parameters, e.g. the thread 
block size, to the particular matrix. On the other hand, 
scalar, vector and CMRS kernels are simple, irreducible 
algorithms, which can be considered as building blocks for 
more advanced implementations; they contain several op- 
timization parameters whose optimal values are found here 
by a brute-search method. 

We found no matrix for which the cuSparse 4.1 CRS 
implementation would give the shortest SpMV product 
time. Although for 15 matrices it turns out significantly 
faster than our CMRS implementation, each of these ma- 
trices is characterized by a low number of nonzero elements 
per row (/i < 5) and for each of them either the HYB or the 
scalar kernel turns out significantly faster (we adopt a con- 
vention that implementation A is significantly faster than 
B if and only if its execution time is at least 10% shorter). 
The CUSP implementation is generally even less efficient. 
The vector kernel could be easily included as a special case 



into our CMRS implementation, with height = 1. The 
fact that for some matrices, e.g. Chebyshev4, this kernel 
gives the fastest SpMV implementation, implies that for 
some matrices the optimal value of height is 1. It is also 
interesting to notice that the CMRS format can improve 
the efficiency of the vector kernel it is based upon by as 
much as 4.3 times (mc2dpi) and even by 4.7 times for the 
artificial matrix p7. 

The scalar kernel needs more attention. We found 21 
matrices for which it is significantly faster than our CMRS 
implementation, but only for 1 of them this kernel turned 
out to be faster than HYB. The scalar kernel exhibits good 
performance for matrices with a small number of nonzero 
elements per row (/i < 7) and its performance can be as 
good as 12.6 GFLOPS and 166 GB/s. However, in most 
of such cases the HYB implementation gives even better 
results. 

The most interesting is comparison of our algorithm 
with the cuSparse 4.1 HYB implementation. We found 
HYB to be significantly faster than our implementation 
for 48 matrices, and it is the fastest implementation for 46 
of them. On the other hand, our implementation is signifi- 
cantly faster than HYB for 55 matrices and for 41 of them 
it is the fastest implementation. Although for 5 matrices 
the HYB implementation is faster than ours from 4 to 9 
times, all these matrices have a special structure, i.e. they 
contain one or just a few rows with a very large number of 
nonzero elements (we will call them 'lpl-like matrices'). 
HYB implementations deals with such cases efficiently be- 
cause it preprocesses each matrix before the first SpMV 
routine can be called on it. Similar preprocessing could 
be implemented in the CMRS format, most likely with a 
similar performance boost. On the other hand, the HYB 
format could not be used for 16 largest matrices because 
of its memory requirements; for each of these matrices our 
SpMV implementation turned out to be the most efficient 
one. 

It is difficult to find a good criterion for an SpMV algo- 
rithm to be the most efficient for general sparse matrices. 
One such criterion could be the sum of single SpMV prod- 
uct execution times for all tested matrices. If we neglect 
in such comparison 5 lpl-like matrices and 16 largest ma- 
trices for which the HYB algorithm could not be run, we 
will get that these times are in the ratio 1.3 : 1.0 : 1.3 
: 2.2 : 3.7 : 1 for the CRS, HYB, CUSP, vector, scalar 
and CMRS kernels, respectively. This could be loosely in- 
terpreted that our CMRS implementation is on average 
as fast as the HYB kernel and faster than other kernels. 
Note that the CMRS format with its adjustable parameter 
height allows for, on average, « 2.2 times faster SpMV 
implementation than the vector kernel with height fixed 
at 1. 

Figure Q] depicts the speedup of our CMRS SpMV im- 
plementation against scalar, vector and HYB kernels as a 
function of the mean number of nonzero elements per row 
(/i) . Clearly /i is a relevant parameter for determining rel- 
ative performance of various SpMV implementations. The 
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Figure 1: (color online) The speedup of our CMRS implementation of 
the SpMV kernel against scalar, vector and HYB kernels as a function 
of the mean number of nonzero elements per row (/i). Results for 
lpl-like matrices are omitted for clarity. 



scalar and hybrid kernels give the shortest SpMV times 
for small ^i, but their efficiency decreases as \x is increased, 
with the scalar kernel being very inefficient for large \x. 
This property of the scalar kernel is related to its inability 
to coalesce data transfers if fi is large. The vector kernel 
behaves in just the opposite way: its efficiency relative to 
other kernels is very good for large fj,, but it decreases as 
[i drops below « 100. This is related to the fact that the 
vector kernel processes each row with a warp of 32 threads, 
and hence is most efficient for sparse matrices with n far 
larger than 32. Interestingly, for /i < 120 the speedup of 
the CMRS over the vector kernel can be approximated as 
6/i~ 0,35 , i.e. by a power law. 

From Fig. [Q it can be immediately seen that the CMRS 
format generally does not yield much improvement over 
the CRS format for fj, > 150, and tends to be system- 
atically slower than the HYB format for fj, < 20. Hence 
one expects that the advantages of the CMRS will be most 
pronounced for moderate values of /i. This is confirmed by 
Fig. 03 which depicts the speedup of our CMRS SpMV im- 
plementation against the best of all five alternative SpMV 
implementations considered here, calculated individually 
for each matrix. Using CMRS can improve the SpMV per- 
formance by up to 43%. However, at the moment there is 
no simple criterion that could be used to tell in advance 
whether using it can be of any benefit. Nevertheless, since 
the CMRS format introduces practically no overhead over 
the vector kernel, we suggest that for the Fermi GPU ar- 
chitecture one should use the CMRS format for matrices 
with [i > 70 and test its efficiency against the HYB format 
for 20 < n < 70. 

A quite large dispersion of the data in Fig. [Q suggests 
that fi is not the only parameter controlling the relative 
performance of various implementations. As a second such 
parameter we propose the standard deviation (cr) of the 



Figure 2: (color online) The speedup of our CMRS implementation 
against the best of all five alternative SpMV implementations for a 
given matrix. 
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Figure 3: (color online) The speedup of CMRS against HYB im- 
plementation as a function of cr. Only matrices such that a < ii are 
taken into account. Inset: enlargement of the plot for cr < 5. 



number of nonzero elements (rii) per row 



/ Y^ rows— t/ 



vf 



(3) 



Figure EH depicts the c-dependence of the efficiency of the 
CMRS-based implementation of the SpMV kernel relative 
to that of HYB. As could be expected, the HYB kernel is 
the most efficient for small values of a. In particular, for 
cr = all rows have the same number of nonzero elements 
and the HYB format reduces to the ELL format without 
any memory overhead. As a grows, the memory overhead 
of ELL begins to reduce the efficiency of HYB and for very 
large a an increasing part of the matrix is processed using 
the COO format, which is computationally inefficient. Our 
results suggest that as long as a < fi, the HYB format 
should be used if cr < 1.5 and the CMRS is more efficient 
for a > 3. Matrices for which a > [i are 'atypical', power- 
law [14J or lpl-like matrices for which the HYB format is 
preferable. 
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5.5. Optimization parameter choice 

Our analysis is based on a brute-force search in a fairly 
large parameter space, which is time consuming and cer- 
tainly impractical. This rises the question of whether it 
is possible to find the optimal set of parameters for the 
CMRS algorithm quickly? While a detailed answer is be- 
yond the scope of this research, there are several conclu- 
sions that can be already drawn from our results. First 
of all, the parameters are not only mutually correlated, 
but also dependent on other factors, like /i, which makes 
the problem hard. For example, since the value of height 
controls the shared memory required by the CMRS kernel, 
different settings for the shared memory size are required 
for height = 2 and for height = 16. While this particu- 
lar correlation can be easily related to the GPU architec- 
ture, other correlations are less obvious. Second, there are 
some regular patterns in the best-case sets of optimization 
parameters. For example, for all test matrices in double 
precision we found that the input vector should be cached 
in the texture cache rather than in LI. It is also clear that 
proper sorting of data elements, as described in Sec. 01 will 
never degrade the SpMV efficiency. Third, the CMRS for- 
mat was already implemented in the SpeedIT 2.0 library, 
and the average performance of the SpMV kernel, which 
uses some very crude heuristics to determine the optimiza- 
tion parameters, already lies above the performance of the 
CRS implementations from Nvidia. Finally, really chal- 
lenging applications of GPUs are in the area of high per- 
formance computing (HPC), where individual calculations 
often take from hours to weeks. If choosing between the 
HYB or CMRS format and then using appropriate inter- 
nal optimization parameters can decrease the computation 
time by a factor of 3, it should be fully acceptable to de- 
vote several seconds or minutes to analyze a matrix (or a 
group of matrices) to find these parameters. It is also very 
likely that a particular sparse matrix format, CMRS or 
HYB, together with its best optimization parameters, will 
be adequate to all matrices appearing in a given problem, 
so that detailed matrix analysis would have to be done 
only once. 

5.6. Comparison with Tesla C2070 

Tesla C2070 is a GPU designed specifically for applica- 
tions in high performance computing. Compared to GTX 
480, which is a high-end gaming card, it delivers almost 3 
times faster theoretical computational performance in dou- 
ble precision and has almost 4 times larger memory, at the 
price of » 20% smaller memory bandwidth (see Tab. [T|). 
We calculated ?y + for all test matrices and found its value 
for C2070 to lay between 0.85 and 1.00 of the correspond- 
ing value for GTX 480. This shows to what extent the 
SpMV kernel is memory-bound: the awesome computing 
power of C2070 can not be exploited in this kernel and the 
only factor that really matters is the memory utilization, 
which turns out a bit better in GTX 480. 



5. 7. Single precision performance 

Since it is more difficult for GPUs to coalesce scat- 
tered memory transfers for single precision data [7J, the 
peak effective bandwidth achieved by GTX 480 in the sin- 
gle precision mode is lower than for double precision (216 
GB/s for matrix nd24k). However, since doing SpMV in 
single precision requires far less bytes to be transferred, 
the computational power is in this case generally larger, 
peaking at 36 GFLOPS for nd24k. The performance of 
our CMRS implementation relative to others tends to be 
even better in single than in double precision. Our method 
turned out the fastest by at least 10% in 58 cases, the max- 
imum speedup is 63%, and it is on average 20% faster than 
Nvidia HYB implementation. We also found that in single 
precision the CMRS format can give a significant speedup 
of the SpMV kernel even for matrices with /i as small as 
6. 

6. Conclusions and Outlook 

The new format for storing sparse matrices, CMRS, is 
designed specifically for optimizing the SpMV operation 
on modern throughput-oriented computer architectures. 
It has several features distinguishing it from other formats 
developed for the same purpose: (i) it is an extension of 
a popular format, CRS, with a quick and in-place conver- 
sion to and from it; (ii) it does not impose any artificial 
pressure on the device memory, in particular, it does not 
require zero-padding; (iii) it does not require any row or 
column reordering nor any other matrix preprocessing; (iv) 
for Fermi-class high-end GPUs it gives the most efficient 
SpMV product implementation for typical sparse matrices 
if the average number of their nonzero elements per row, ji, 
is greater than s=a 70 or when the standard deviation of the 
number of nonzero elements per row, a, is greater than 
3; it can allow for the most efficient implementation for 
some matrices satisfying /i < 70 and 1.5 < /z < 3 . These 
features should facilitate its adoption in existing software 
and open new possibilities for its further optimization, e.g. 
by combining it with other formats for better handling of 
atypical sparse matrices, e.g. lpl. Moreover, the fact that 
our CMRS-based implementation of the SpMV kernel of- 
ten approaches the hardware limit suggests that this for- 
mat will scale well into future GPU architectures and can 
prove efficient also in other throughput-oriented computer 
architectures. 

Our results reveal the importance of developing a rep- 
resentative collection (or collections) of sparse matrices for 
which the SpMV product is a truly relevant operation. We 
are aware that the two simple criteria we applied here (a 
matrix must be square and have at lest 10 6 nonzero ele- 
ments) do not lead to a 'representative' collection, as the 
UF SMC contains also very unusual matrices for which 
the SpMV product is very unlikely to be applicable, e.g. 
matrices with empty rows or columns. Such atypical ma- 
trices can obfuscate the general picture of the SpMV per- 
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formance and its dependence on the matrix format and 
techniques used to implement it. 

We also demonstrated the importance of general char- 
acteristics of SpMV product efficiency, e.g. the bandwidth 
efficiency. Such parameters enable to identify matrices for 
which further optimization is required as well as help de- 
termine the quality of hardware support for the SpMV 
operation. 
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Additional material 

This section contains an extended version of Tab. [5J with the data for all 138 sparse matrices from the Florida 
University Sparse Matrix Collection used in our study, including the value of a, obtained for Nvidia GTX 480 graphics 
card in double precision. As some of the test matrices in the original collection contain explicit zeroes, in their cases 
we provide the data for the matrix both without and with the explicit zeros included in the computer representation of 
the matrix. Matrices with explicit zeros are distinguished by adding (*) to their names, e.g. "ohne2 (*)". Matrices are 
ordered according to their "kind" , as given in the original collection. The following short forms of the kind were used 
for brevity: 



short form full problem name 



2D/3D 2-dimensional/3-dimensional problem 

CFD computational fluid dynamics problem 

circuit sim. circuit simulation problem 

Dgraph directed graph problem 

economic economic problem 

EM electromagnetics problem 

lstsq least squares problem 

linear prog. linear programming problem 

materials materials problem 

model red. model reduction problem 

optimization optimization problem 

power net. power network problem 

QCD quark propagators (QCD/LGT) problem 

qchcm. theoretical/quantum chemistry problem 

scmicond. semiconductor device problem 

structural structural problem 

thermal thermal problem 

Ugraph undirected graph problem 
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7 


127 


0.74 


1.66 


1.17 


1.64 


1.51 


4.32 


18.0 


184 


1.04 


0.63 


62 


model redu. 


boneSlO 


44.7 


1.68 


41 


915 


4.38 


1.64 


1.02 


2.00 


1.43 


4.57 


18.5 


188 


1.06 


0.65 


63 


model redu. 


boneSlO (*) 


60.6 


2.62 


55 


915 


5.68 


1.45 


— 


1.84 


1.29 


4.62 


19.4 


197 


1.11 


0.68 


64 


model redu. 


filter3D 


25.4 


2.51 


3 


106 


0.41 


1.97 


1.15 


1.67 


1.94 


3.09 


13.0 


134 


0.76 


0.47 


65 


optimization 


largebasis 


11.9 


0.29 


5 


440 


1.09 


1.28 


0.56 


1.20 


2.54 


1.11 


9.2 


99 


0.56 


0.36 


66 


optimization 


kkt_power 


6.2 


2.60 


13 


2063 


3.77 


1.07 


1.26 


1.31 


2.85 


1.23 


6.2 


72 


0.41 


0.28 


67 


optimization 


kkt_power (*) 


7.1 


2.78 


15 


2063 


4.08 


1.02 


1.44 


1.34 


2.66 


1.56 


6.7 


76 


0.43 


0.29 


68 


optimization 


lpl 


3.1 


194.74 


2 


534 


3.26 


11.93 


0.11 


4.33 


0.95 


12.54 


0.8 


12 


0.07 


0.05 


69 


optimization 


nlpkktl20 


26.9 


0.58 


95 


3542 


11.78 


1.78 


— 


2.16 


1.71 


3.74 


15.8 


164 


0.93 


0.58 


70 


optimization 


nlpkktl20 (*) 


27.3 


0.59 


97 


3542 


11.96 


1.76 


— 


1.93 


1.69 


3.73 


15.9 


165 


0.93 


0.58 



Id 




matrix properties 






SpMV 




speedup against 




computat. 


memory util. 




kind 


name 


V 


a 


nnz 

[xlO 6 ] 


rows 

[xlO 3 ] 


time 

[ms] 


CRS 


HYB 


CUSP 


vect. 


seal. 


perform. 
[GFLOPS] 


bandw. 

[GB/s] 


V- 


n+ 


71 


optimization 


nlpkkt80 


26.5 


0.71 


28 


1062 


3.62 


1.85 


0.70 


1.91 


1.79 


3.60 


15.3 


158 


0.89 


0.56 


72 


power net. 


TSOPF_FS.b300.cl 


150.6 


55.10 


4 


29 


0.50 


1.07 


2.28 


1.11 


0.97 


26.25 


17.5 


176 


1.00 


0.60 


73 


power net. 


TSOPF_FS.b300.c2 


154.3 


77.45 


9 


57 


0.96 


1.08 


2.32 


1.15 


0.98 


28.16 


18.1 


183 


1.03 


0.62 


74 


power net. 


TSOPF_FS.b300.c3 


155.6 


94.69 


13 


84 


1.41 


1.09 


2.37 


1.22 


0.98 


28.83 


18.5 


187 


1.05 


0.64 


75 


power net. 


TSOPF_RS.b2383 


424.2 


23.51 


16 


38 


1.32 


1.19 


3.16 


1.10 


1.08 


6.50 


24.5 


245 


1.39 


0.83 


76 


power net. 


TSOPF_RS.b2383.cl 


424.2 


23.51 


16 


38 


1.32 


1.19 


3.16 


1.10 


1.08 


6.69 


24.5 


246 


1.39 


0.83 


77 


QCD 


qcd5_4 


39.0 


7e-4 


2 


49 


0.23 


1.98 


0.80 


1.58 


1.74 


4.90 


16.4 


168 


0.95 


0.58 


78 


qchem. 


Gal9Asl9H42 


66.7 


12.63 


9 


133 


1.03 


1.48 


1.57 


1.39 


1.28 


4.00 


17.2 


174 


0.98 


0.60 


79 


qchem. 


CO 


34.7 


1.20 


8 


221 


0.91 


2.06 


0.85 


1.88 


1.93 


4.77 


16.6 


171 


0.97 


0.59 


80 


qchem. 


GalOAslOH30 


54.1 


11.20 


6 


113 


0.73 


1.58 


1.37 


1.46 


1.41 


3.97 


16.6 


169 


0.96 


0.58 


81 


qchem. 


Ga3As3H12 


97.3 


22.77 


6 


61 


0.65 


1.32 


1.97 


1.27 


1.13 


4.43 


18.2 


184 


1.04 


0.63 


82 


qchem. 


Ga41As41H72 


69.0 


12.69 


18 


268 


2.08 


1.51 


1.60 


1.48 


1.29 


4.12 


17.7 


179 


1.01 


0.62 


83 


qchem. 


Ge87H76 


69.9 


10.79 


8 


113 


0.89 


1.51 


1.65 


1.45 


1.31 


4.33 


17.7 


179 


1.01 


0.62 


84 


qchem. 


Ge99H100 


74.8 


10.93 


8 


113 


0.94 


1.48 


1.72 


1.44 


1.28 


4.32 


17.9 


181 


1.03 


0.62 


85 


qchem. 


Si34H36 


52.9 


9.48 


5 


98 


0.61 


1.62 


1.42 


1.51 


1.43 


4.05 


16.7 


170 


0.96 


0.59 


86 


qchem. 


Si41Ge41H72 


80.9 


14.12 


15 


186 


1.59 


1.5 


1.83 


1.47 


1.28 


4.41 


18.8 


190 


1.07 


0.65 


87 


qchem. 


S187H76 


44.4 


5.96 


11 


240 


1.23 


1.83 


1.17 


1.73 


1.66 


4.21 


17.2 


175 


0.99 


0.61 


88 


qchem. 


Si02 


72.6 


34.51 


11 


155 


1.27 


1.43 


1.66 


1.51 


1.28 


4.37 


17.6 


178 


1.01 


0.61 


89 


semicond. 


ohne2 


37.9 


3.91 


7 


181 


0.84 


1.89 


1.31 


1.90 


1.72 


4.42 


16.1 


166 


0.93 


0.57 


90 


scmicond. 


ohne2 (*) 


61.0 


2.70 


11 


181 


1.18 


1.63 


1.43 


1.73 


1.45 


5.30 


18.6 


189 


1.07 


0.65 


91 


semicond. 


para-4 


19.1 


3.94 


3 


153 


0.50 


2.17 


1.09 


2.11 


2.21 


4.05 


11.5 


121 


0.68 


0.43 


92 


semicond. 


para-4 (*) 


34.8 


3.37 


5 


153 


0.70 


1.87 


1.41 


1.81 


1.76 


4.43 


15.0 


155 


0.87 


0.54 


93 


semicond. 


barrier2-ll 


18.7 


6.17 


2 


116 


0.40 


2.04 


1.03 


2.04 


2.07 


6.97 


10.4 


110 


0.62 


0.39 


94 


semicond. 


barrier2-ll (*) 


33.7 


4.90 


4 


116 


0.54 


1.82 


1.40 


1.80 


1.71 


6.61 


14.3 


147 


0.83 


0.51 


95 


structural 


Fault_639 


42.7 


1.40 


27 


639 


3.18 


1.64 


0.83 


1.46 


1.46 


4.81 


17.0 


173 


0.98 


0.60 


96 


structural 


apache2 


6.7 


0.17 


5 


715 


1.22 


0.89 


0.39 


0.79 


3.33 


0.63 


7.3 


84 


0.47 


0.32 


97 


structural 


Emilia_923 


43.7 


0.99 


40 


923 


4.52 


1.65 


0.85 


1.97 


1.45 


5.13 


17.7 


181 


1.02 


0.63 


98 


structural 


m_tl 


100.0 


2.86 


10 


98 


0.91 


1.41 


1.39 


1.48 


1.30 


5.19 


21.4 


216 


1.22 


0.74 


99 


structural 


thread 


149.5 


3.85 


4 


30 


0.42 


1.25 


1.54 


1.22 


1.06 


5.53 


21.0 


211 


1.19 


0.72 


100 


structural 


xl04 


80.4 


3.74 


9 


108 


0.87 


1.46 


1.27 


1.68 


1.36 


4.72 


19.9 


202 


1.14 


0.69 


101 


structural 


pwtk 


52.9 


0.75 


12 


218 


1.11 


1.88 


0.91 


1.85 


1.63 


6.13 


20.5 


209 


1.18 


0.72 


102 


structural 


af_0_kl01 


34.8 


0.21 


18 


504 


2.00 


1.93 


0.76 


2.01 


1.8 


5.02 


17.3 


178 


1.00 


0.62 


103 


structural 


afjshelllO 


34.7 


0.17 


52 


1508 


5.99 


1.82 


0.75 


2.06 


1.65 


5.04 


17.2 


177 


1.00 


0.61 


104 


structural 


audikw_l 


82.3 


4.68 


78 


944 


7.45 


1.38 


— 


1.81 


1.22 


5.51 


20.7 


209 


1.18 


0.72 


105 


structural 


bmw3_2 


49.7 


1.96 


11 


227 


1.19 


1.80 


0.98 


1.77 


1.59 


4.83 


18.8 


192 


1.08 


0.66 



Id 






matrix properties 






SpMV 




speedup against 
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memory util. 
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M 


a 


nnz 

[xlO 6 ] 


rows 

[xlO 3 ] 


time 

[his] 


CRS 


HYB 


CUSP 


vect. 


seal. 


perform. 
[GFLOPS] 
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[GB/s] 


V- 


V+ 


106 


structural 


bmw7st_l 


51.8 


1.78 


7 


141 


0.78 


1.75 


1.00 


1.69 


1.55 


4.86 


18.7 


190 


1.07 


0.66 


107 


structural 


bmwcra_l 


71.5 


2.19 


11 


149 


1.07 


1.52 


1.01 


1.57 


1.36 


5.30 


19.7 


200 


1.13 


0.69 


108 


structural 


Chebyshev4 


78.9 


119.46 


5 


68 


1.68 


1.24 


0.67 


0.80 


0.55 


12.20 


6.3 


64 


0.36 


0.22 


109 


structural 


crankseg_l 


201.0 


6.25 


11 


53 


0.95 


1.23 


1.57 


1.28 


1.09 


5.88 


22.4 


225 


1.27 


0.77 


110 


structural 


crankseg_2 


221.6 


6.44 


14 


64 


1.24 


1.22 


1.52 


1.29 


1.08 


5.96 


22.7 


228 


1.29 


0.78 


111 


structural 


Fl 


78.1 


4.62 


27 


344 


2.79 


1.46 


1.51 


1.76 


1.33 


5.61 


19.1 


194 


1.09 


0.66 


112 


structural 


F2 


74.0 


4.36 


5 


72 


0.57 


1.50 


1.35 


1.52 


1.33 


5.25 


18.4 


187 


1.05 


0.64 


113 


structural 


Geo_1438 


41.9 


1.54 


60 


1438 


6.86 


1.63 


— 


2.06 


1.44 


4.83 


17.3 


177 


1.00 


0.61 


114 


structural 


Geo_1438 (*) 


43.9 


0.66 


63 


1438 


7.06 


1.62 


— 


2.03 


1.41 


5.13 


17.7 


181 


1.02 


0.63 


115 


structural 


hood 


44.9 


2.31 


10 


221 


1.12 


1.80 


1.17 


1.82 


1.60 


4.28 


17.5 


179 


1.01 


0.62 


116 


structural 


Hook_1498 


39.6 


2.38 


59 


1498 


6.87 


1.60 


— 


2.00 


1.40 


3.97 


17.1 


175 


0.99 


0.61 


117 


structural 


Hook_1498 (*) 


40.7 


2.19 


61 


1498 


6.98 


1.58 


— 


1.98 


1.39 


4.03 


17.2 


176 


1.00 


0.61 


118 


structural 


inline_l 


73.1 


4.17 


37 


504 


3.77 


1.42 


1.32 


1.72 


1.28 


5.31 


19.4 


197 


1.11 


0.67 


119 


structural 


ldoor 


44.6 


2.21 


42 


952 


4.67 


1.62 


1.13 


2.03 


1.41 


4.38 


18.0 


184 


1.04 


0.64 


120 


structural 


msdoor 


46.1 


2.02 


19 


416 


2.12 


1.70 


1.08 


1.88 


1.52 


4.49 


17.9 


183 


1.03 


0.63 


121 


structural 


Serena 


46.1 


1.46 


64 


1391 


7.12 


1.56 


— 


2.04 


1.36 


4.73 


17.8 


182 


1.03 


0.63 


122 


structural 


Serena (*) 


46.4 


1.36 


65 


1391 


7.14 


1.56 


— 


2.04 


1.35 


4.76 


17.9 


182 


1.03 


0.63 


123 


structural 


ship_001 


111.6 


5.03 


4 


35 


0.39 


1.34 


1.64 


1.35 


1.20 


5.12 


20.1 


203 


1.15 


0.69 


124 


structural 


ship_001 (*) 


133.0 


4.79 


5 


35 


0.44 


1.30 


1.52 


1.29 


1.17 


5.20 


20.9 


211 


1.19 


0.72 


125 


structural 


shipsecl 


25.3 


2.12 


4 


141 


0.50 


2.06 


1.15 


1.95 


2.06 


3.15 


13.9 


144 


0.82 


0.51 


126 


structural 


shipsecl (*) 


55.5 


1.49 


8 


141 


0.85 


1.62 


0.98 


1.56 


1.43 


4.42 


18.1 


185 


1.04 


0.64 


127 


structural 


shipsec8 


28.8 


2.40 


3 


115 


0.45 


1.99 


1.20 


1.68 


1.92 


3.39 


14.6 


151 


0.85 


0.53 


128 


structural 


shipsec8 (*) 


57.9 


1.88 


7 


115 


0.70 


1.65 


1.09 


1.61 


1.46 


4.75 


18.7 


190 


1.08 


0.66 


129 


structural 


smc3Dc 


73.3 


4.32 


3 


43 


0.48 


1.28 


1.34 


1.32 


1.23 


4.98 


13.0 


132 


0.75 


0.45 


130 


structural 


shipsec5 


25.6 


2.15 


5 


180 


0.64 


2.08 


1.11 


1.84 


2.04 


3.20 


14.2 


147 


0.83 


0.52 


131 


thermal 


FEM_3D_thermal2 


23.6 


0.92 


3 


148 


0.47 


2.13 


0.77 


2.03 


2.18 


3.03 


14.5 


150 


0.85 


0.53 


132 


thermal 


thermal2 


7.0 


0.31 


9 


1228 


2.13 


1.22 


0.71 


0.85 


3.15 


0.81 


7.5 


85 


0.48 


0.33 


133 


thermal 


thermomech_dK 


13.9 


0.38 


3 


204 


0.59 


1.25 


0.76 


1.01 


2.35 


1.70 


9.4 


100 


0.57 


0.36 


134 


Ugraph 


asia_osm 


2.1 


0.33 


25 


11951 


14.27 


0.59 


0.34 


0.43 


4.30 


0.29 


2.7 


43 


0.24 


0.20 


135 


Ugraph 


human_genel 


1107.1 


42.35 


25 


22 


2.72 


1.43 


2.26 


1.43 


1.26 


6.40 


18.2 


182 


1.03 


0.62 


136 


Ugraph 


mouse_gene 


642.3 


33.80 


29 


45 


3.70 


1.38 


1.96 


1.42 


1.26 


5.30 


15.6 


157 


0.88 


0.53 


137 


Ugraph 


human ^gene2 


1260.0 


38.74 


18 


14 


1.91 


1.41 


2.28 


1.50 


1.25 


6.67 


18.9 


189 


1.07 


0.64 


138 


Ugraph 


pdblHYS 


119.3 


2.92 


4 


36 


0.41 


1.31 


1.34 


1.34 


1.19 


5.44 


21.1 


213 


1.20 


0.73 



